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1.  Overview 


During  the  first  year  of  the  award  period,  the  Computer  Science  Department 
of  the  University  of  Rochester  has  pursued  three  main  lines  of  work:  systems 
support  algorithms.  Butterfly  programming  environment,  and  vision  applications. 
The  work  under  the  Strategic  Computing  grant  meshes  nicely  with  the  work  done 
under  our  DARPA  Image  Understanding  grant.  The  IU  grant  supports  more 
fundamental  work  whose  technological  implications  are  explored  under  the  SC 
grant  The  SC  grant  is  mainly  concerned  however  with  rendering  today’s 
Multiprocessor  computer  architectures  useful.  Often  work  is  jointly  funded,  and 
the  references  at  the  end  of  this  report  includes  some  work  done  with  the  help  of 
other  funding.  Some  of  this  related  work  is  described  briefly  below  . 

Today’s  multiprocessor  computer  architectures  are  not  efficiently  programmed 
or  even  conceptualized  with  standard  computer  languages,  and  their  operating 
systems  and  debugging  tools  are  also  challengingly  different  Of  all  the  SC  Vision 
contractors.  Rochester  seems  to  be  doing  the  most  work  in  the  area  of  tools  for 
controlling  large-grain  parallelism,  as  one  finds  in  a  distributed  multiprocessor 
application  like  the  ALV,  or  in  tightly  coupled  processors  like  the  Hypercube  or 
the  Butterfly  Parallel  Processor. 


The  references  show  two  new  technical  report  series.  The  Butterfly  Project 
Report  and  Hierarchical  Process  Composition  Project  Report  series  document  our 
work  on  the  Butterfly  Parallel  Processor  and  in  the  HPC  computational  model. 


2.  The  Butterfly  and  the  Uniform  System 

The  University  of  Rochester  operates  three  Butterfly  configurations  with  3, 16 
and  120  processors.  These  machines  were  procured  with  the  help  of  an  NSF 
Coordinated  Experimental  Research  grant  Each  processor  consists  of  a  68000- 
based  microcomputer  connected  to  a  bit-slice  microengine.  The  microengine  in 
turn  is  connected  to  local  memory  (one  megabyte  per  processor),  local  i/o 
(synchronous  and  asynchronous  serial  lines,  plus  an  optional  multibus),  and  a 
high-speed  switch.  The  microengine  maps  all  memory  references  made  by  the 
68000  into  either  local  memory,  local  i/o,  microengine  control  registers  or  (via  the 
switch)  some  other  processor’s  local  memory.  Remote  memory  references  take  two 
to  four  microseconds  in  the  absence  of  contention,  compared  to  half  a 
microsecond  for  local  references.  Alternatively,  the  68000  can  direct  its 
microengine  to  execute  DMA  transfers  between  local  and  remote  memory  at  4 
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megabytes/second.  The  operating  system  supports  multitasking  on  each  node  and 
provides  a  variety  of  microcode-assisted  message  passing  and  synchronization 
primitives. 

For  much  of  our  Butterfly  vision  work  we  use  the  Uniform  System,  a 
software  package  that  restricts  the  generality  of  the  machine  a  great  deal  but 
simplifies  coding.  When  a  Uniform  System  program  is  initialized  it  starts  a  special 
process  called  a  task  demon  on  every  processor  on  the  machine.  It  then  allocates 
12  megabytes  of  memory  (taking  equal  amounts  from  the  local  memory  of  each 
processor)  and  arranges  for  all  processes  to  map  that  memory  into  the  same  place 
in  their  address  spaces.  This  makes  it  possible  for  them  to  pass  pointers  to  each 
other.  The  original  process  can  call  generator  functions  to  place  task  descriptions 
on  a  global  agenda,  where  they  are  picked  up  and  executed  by  the  task  demons. 
Tasks  can  call  generators  recursively.  The  generators  are  typically  used  to  simulate 
parallel  FOR  loops.  For  example,  a  task  description  might  consist  of  a  work 
function  and  an  index;  task  demons  would  atomically  read  and  decrement  the 
index  and  call  the  work  function  with  the  result  of  the  read.  Uniform  System 
applications  tend  to  have  a  S1MD  flavor,  but  the  system  differs  from  true  SIMD 
machines  in  that  the  "single  instruction"  performed  can  be  an  arbitrary  function 
call. 

3.  Systems  Support  Algorithms 

The  work  by  Sanchis  [1986]  and  Newman-Wolfe  [1985;  1986]  is  actually 
theoretical  computer  science  brought  to  bear  on  MIMD  architectures.  This 
research  has  contributed  algorithms  for  enhancing  reliability  and  reducing 
contention  on  the  butterfly  switch,  and  on  several  other  topologies  that  exist  in 
current  MIMD  computers. 

Liudvikas  Bukys  has  provided  vital  system  support  and  applications  coding 
throughout  the  funding  period.  His  contributions  include  network  software, 
operating  system  installations  and  updates,  a  fast  union-find  algorithm  useful  in 
vision  applications  for  region-growing  and  edge-linking,  and  general  expertise  with 
all  levels  of  the  Butterfly  system. 

4.  Systems  Research  and  Programming  Environments 

Our  systems  research  has  concentrated  on  three  main  areas:  Basic  building 
blocks.  Programming  environments,  and  Programming  models.  [BPR  series,  HPC 
series,  and  papers  by  LeBlanc,  Scott,  Finkel  et  al,  Friedberg,  Ohkami  and 
Baldwin]. 

The  basic  building  blocks  are  individual  pieces  of  work  that  represent 
improvements  to  the  existing  Butterfly  systems.  They  can  be  incorporated  in  later 
software  systems  or  can  be  pursued  for  their  own  interest  without  the  commitment 
to  integrate  them  into  a  permanent  system.  Our  work  has  included;  Uniform 
System  modifications,  a  parallel  first  fit  memory  allocation,  measurement  of 
Chrysalis  primitives,  SAR  management  to  increase  the  number  of  available 
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processes,  and  the  Modula-2  and  C++  languages  as  natural  extensions  to  the  C 
environment 

A  viable  parallel  programming  environment  is  a  basic  goal  of  our  SC  Vision 
effort  To  date  we  have  worked  on:  a  parallel  file  system,  parallel  compilation, 
debugging  tools,  and  have  developed  and  commissioned  the  SMP  library  and  the 
LYNX  programming  language. 


4.1.  Structured  Message  Passing 

SMP  is  a  model  of  parallel  programs  that  consists  of  process  families  whose 
members  are  created  and  destroyed  together,  and  interprocess  communication, 
within  a  family,  based  on  asynchronous  message-passing  according  to  a  fixed 
communication  topology.  There  is  a  dynamic  hierarchy  of  process  families.  SMP 
has  been  implemented  [Leblanc,  BPR-8].  The  goals  of  SMP  are:  A  user-level 
programming  environment  independent  of  the  number  of  physical  processors. 
Reduced  coding  overhead  associated  with  creating  a  set  of  processes. 
Communication  connections  between  processes  according  to  a  given 
interconnection  pattern  and  naming  scheme.  Investigation  of  the  utility  of 
asynchronous  message-passing  as  the  primary  form  of  interprocess  communication 
within  the  Butterfly,  and  Comparison  of  message-passing  with  shared  memory  for 
various  applications. 

SMP  offers  a  message-based  alternative  to  the  Uniform  System  package.  Its 
heavyweight  process  model  complements  the  lightweight  process  model  of 
Modula-2.  It  gives  us  additional  experience  with  a  dynamic  hierarchy  of  processes 
that  use  asynchronous  message-passing. 

The  user  interface  is  a  handful  of  subroutine  calls. 

/*  Create  a  family  of  communicating  processes  V 
SMP_family_id  SMP_Spawn  (topology,  codef,  argvf,  data) 

/*  Destroy  all  processes  in  a  family  V 
void  SMP_Kill  (family_id) 

/*  Send  a  message  to  a  set  of  processes  in  the  same  family  V 
int  SMP_Send  (family,  dests,  num.dest,  buffer,  size) 

/*  Receive  a  message  from  a  process  */ 

int  SMP_Receive  (family,  srcs,  num_srcs,  buffer,  size,  sender) 

Additional  library  routines  can  be  used  to  create  predefined  interconnection 
topologies  or  query  a  particular  topology. 

The  Butterfly  implementation  of  SMP  message-passing  is  as  follows:  message 
buffers  are  implemented  using  shared  memory  objects;  buffers  are  written  by  a 
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single  process  (the  sender),  but  read  by  many;  a  count  of  intended  recipients  is 
associated  with  each  message;  message  buffers  are  dynamically  mapped  into 
address  spaces  using  a  SAR  cache;  In  our  experience  so  far,  SMP  is  easy  to  learn, 
its  performance  is  very  good,  and  the  SAR  cache  idea  is  useful  in  practice.  In  the 
future  we  plan  to  explore  new  applications,  to  experiment  with  message-passing 
implementations,  to  integrate  SMP  with  Modula-2,  and  to  let  SMP  cooperate  with 
Instant  Replay. 

4.2.  Instant  Replay 

Debugging  parallel  programs  is  difficult  There  are  multiple  threads  of 
control,  there  is  non-repeatable  behavior  due  to  timing  variations,  there  is  wide- 
ranging  granularity  of  communication,  and  a  lack  of  multiprocess  debugging  tools. 
In  this  work  we  assume;  programs  are  communication-intensive,  there  is  no  direct 
microcode  support  for  debugging,  and  programs  use  full  Chrysalis  capabilities. 
Our  goal  is  to  develop  a  flexible  monitoring  system  that  provides  Instant  Replav  of 
parallel  programs  with  minimal  impact  on  program  performance,  minimal 
information  collected,  distributed  data  collection,  independence  from  particular 
machine  models,  for  both  loosely-coupled  and  tightly-coupled  domains  Instant 
Replay  has  been  implemented  [Leblanc  and  Mellor-Crummey,  in  preparation]. 

The  crucial  observation  is  that  data  values  in  a  shared  object  depend  only 
upon:  initial  values  in  the  object,  the  deterministic  nature  of  the  processes 
operating  on  the  object,  the  underlying  virtual  machine,  the  inputs  to  processes 
from  the  external  environment,  and  the  relative  order  of  operations  on  the  object 
Thus  Instant  Replay  saves  the  relative  order  of  events  during  program  execution 
rather  that  the  data  associated  with  each  event  Each  time  a  shared  object  changes 
state,  it  is  assigned  a  new  version  number.  Each  time  a  shared  object  is  accessed, 
that  fact  is  recorded  on  an  object  history  tape.  Each  process  records  the  version 
number  of  every  shared  object  it  accesses.  For  debug^ng,  each  process  replays  its 
own  history  tape.  Each  process  that  attempts  to  access  a  shared  object  must  wait 
until  the  correct  version  of  the  object  is  available. 

A  prototype  SMP  implementation  with  Instant  Replay  has  been  built  The 
relevant  actions  for  Send  Message  are:  find  buffer,  Acquire  Write  Lock  (buffer), 
copy  message  into  buffer,  set  number  of  recipients.  Release  Write  Lock  (buffer). 
For  Receive  Message  they  are:  Acquire  Read  Polling  Lock  (iterator),  poll 
incoming  message  buffers,  copy  message  into  user  area.  Release  Polling  Lock, 
Acquire  Write  Lock  (buffer),  decrement  number  of  recipients,  Release  Write  Lock 
(buffer).  In  our  experience  so  far,  Instant  Replay  adds  minimal  overhead  to  the 
running  time  of  a  program.  It  is  a  practical  tool  that  does  not  produce  excessive 
contention.  There  are  some  locks  involved,  and  it  is  their  placement  that  the 
efficiency  of  the  system  resides.  The  rules  are  simple:  lock  concurrent  structures 
(not  shared  structures),  and  optimize  locks  on  idempotent  (read)  operations.  In 
the  future  we  shall  work  on:  code  efficiency  improvements,  proof  of  correctness, 
empirical  studies  with  different  programming  models,  automatic  instrumentation, 
integration  into  programming  environment,  exploring  effect  on  programming 
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cycle. 

4.3.  LYNX 

The  Butterfly  implementation  of  LYNX  is  complete.  A  partial 
implementation  was  made  publically  available  on  15  March.  Exception  handling 
was  installed  on  12  June.  Pointers  were  installed  on  14  July.  The  exception¬ 
handling  features  resemble  those  of  Ada.  The  pointer  facilities  allow  LYNX 
processes  to  share  access  to  blocks  of  Butterfly  memory.  Exceptions  can  be  bound 
to  Chrysalis  throw  codes  in  such  a  way  that  throws  in  external  C  routines 
propagate  back  into  LYNX  programs  as  bona-fide  exceptions. 

Library  packages  have  been  written  to  support:  storage  of  links  in  the 
Chrysalis  name  table,  access  to  environment  variables,  pointer-based  shared  object 
support,  screen-oriented  output 

Utility  programs  provide  access  to  files  on  the  host  machine  and  support 
interactive  manipulation  (from  the  Butterfly  shell)  of  name-table  links. 

We  have  begun  actively  to  encourage  the  growth  of  a  user  community.  We 
have  built  a  distributed  game-tree  searching  program  and  are  in  the  process  of 
benchmarking  it  We  should  be  able  to  report  on  the  results,  together  with  further 
applications,  in  the  fall  of  1986. 

>From  an  initial  time  of  5.9  milliseconds  (best  case),  we  have  reduced  the  cost 
of  message  passing  to  2.4  ms  per  remote  operation.  We  are  considering  protocol 
optimizations  that  should  bring  the  cost  to  well  below  2  ms. 

A  detailed  language  rationale  for  LYNX  will  appear  in  the  December  1986 
issue  of  IEEE  Transactions  of  Software  Engineering.  A  technical  note  on  the 
LYNX  type-checking  mechanism  will  appear  in  the  same  journal  at  a  later  date. 
A  paper  comparing  the  Butterfly  version  of  LYNX  with  two  earlier 
implementations  will  be  presented  at  the  1986  International  Conference  on  Parallel 
Processing.  Earlier  versions  of  the  language  rationale  and  of  the  implementation 
paper  were  published  as  Rochester  Technical  Reports  and  Butterfly  Project 
Reports.  The  implementation  paper  will  appear  in  the  Department’s  1986-87 
Research  Review.  A  LYNX  reference  manual  was  published  as  BPR  7. 

4.4.  PSYCHE 

We  have  begun  work  on  an  operating  system  for  the  Butterfly,  under  the 
joint  direction  of  Profs.  LeBlanc  and  Scott.  The  project  is  known  as  Psyche,  and 
has  as  its  goal  the  development  of  systems  software  to  support  a  wide  variety  of 
parallel  programming  paradigms. 

Psyche  will 

(1)  be  more  convenient  than  Chrysalis, 

(2)  support  multiple  users  on  a  single  Butterfly, 


(3)  permit  experimentation  with  shared  memory,  message  passing,  and  options  in 
between,  and 

(4)  permit  well-structured  communication  between  pieces  of  an  application  that 
use  different  paradigms. 

We  are  convinced  that  no  one  model  of  parallelism  will  prove  appropriate  for 
all  applications.  Some  algorithms  are  easier  to  implement  with  fully-shared 
memory.  Others  are  most  clearly  conceived  with  message  passing.  Still  others 
need  an  option,  such  as  monitors,  that  falls  somewhere  in-between. 

A  major  thrust  of  our  work  so  far  has  been  the  comparison  of  solutions  to 
common  problems  under  various  programming  models  (see,  for  example,  BPR  3). 
We  are  fortunate  with  the  Butterfly  to  be  using  hardware  that  lacks  a  built-in  bias 
toward  any  one  of  these  approaches.  We  hope  with  Psyche  to  exploit  this  lack  of 
bias  to  develop  an  operating  system  that  allows  each  application,  or  part  of  an 
application,  to  be  written  under  the  programming  model  most  appropriate  for  its 
own  particular  needs.  We  expect  Psyche  to  provide  simple,  well-defined 
mechanisms  for  interaction  between  pieces  of  code  that  employ  different  models. 

The  fundamental  concept  in  Psyche  is  the  realm.  On  the  Butterfly,  a  realm 
will  be  a  memory  object  (or  set  of  objects)  with  an  associated  protocol  that  governs 
its  access.  There  will  be  a  many-to-many  relationship  between  processes  and 
realms:  each  realm  may  be  shared  by  an  arbitrary  number  of  processes  and  each 
process  may  have  access  to  an  arbitrary  number  of  realms.  Psyche  will  provide 
mechanisms  for  creating,  destroying,  and  managing  access  to  realms.  In  effect,  it 
will  allow  a  program  to  bind  access  protocols  and  protection  mechanisms  to  sets  of 
memory  objects. 

Examples  of  access  protocols  for  realms  include: 

(1)  Pure  shared  memory  in  the  style  of  the  BBN  Uniform  System.  A  single, 
large,  static  realm  would  be  shared  by  all  processes.  The  access  protocol 
would  permit  unrestricted  reads  and  writes  of  individual  memory  cefls. 

(2)  Connection-less  message  passing.  Each  message  would  be  a  separate  realm. 
To  send  a  message  one  would  make  the  realm  accessible  to  the  receiver  and 
(probably)  inaccessible  to  the  sender. 

(3)  Connection-based  message  passing,  in  the  style  of  LYNX.  Each 
communication  channel  (link)  would  be  accessible  to  two  processes  and 
would  contain  buffers  through  which  they  could  communicate. 

(4)  Monitors.  Each  realm  would  have  access  routines  and  a  monitor  lock.  The 
realm  protocol  would  insist  that  the  access  routines  acquire  the  monitor  lock 
before  execution. 

(5)  Path  expressions.  Analogous  to  monitors,  each  realm  would  have  access 
routines  and  an  access  protocol  that  would  enforce  the  ordering  rules 
described  by  path  expressions. 
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One  major  issue  in  the  design  of  Psyche  will  be  protocol  enforcement  Our 
goal  is  to  make  the  probability  of  accidental  misuse  of  a  domain  acceptably  small. 
We  have  no  intention  of  making  it  impossible.  Protection  mechanisms  at  our 
disposal  include 

(1)  Manipulation  of  memory  maps  to  make  domains  unaddressable. 

(2)  Manipulation  of  protection  bits  to  make  domains  unreadable,  unwritable,  or 

unexecutable. 

(3)  Modification  of  compilers  to  make  inappropriate  domain  operations 

unexpressible. 

The  first  two  options  would  be  much  more  attractive  in  the  presence  of  the 
68020  Butterfly  upgrade.  In  addition  to  the  oft-cited  advantages  of  increased 
speed  and  hardware  floating  point,  the  68020s  would  permit  restarting  of 
instructions,  allowing  us  to  handle  protection  faults  and  implement  virtual 
memory. 

4.5.  CONSUL 

The  CONSUL  project  is  an  attempt  to  simplify  the  use  of  multi-processor 
computers  for  general-purpose  programming  through  automatic  detection  of 
parallelism  in  programs.  Current  programming  techniques  for  multi-processors 
require  programmers  to  worry  about  two  related  but  distinct  issues:  how  to 
express  a  solution  to  their  problem  as  a  program,  and  how  to  partition  this 
program  into  parallel  pieces.  Multi-processor  computers  will  never  be  as  easy  to 
program  as  sequential  ones  until  programmers  are  freed  from  the  need  to 
parallelize  programs  manually.  We  believe  that  the  best  solution  to  this  problem  is 
to  develop  compilers  that  will  automatically  detect  and  exploit  parallelism  in 
programs  that  have  not  been  explicitly  parallelized  by  their  authors.  Unfortunately, 
automatic  parallelization  is  a  difficult  problem  that  has  so  far  resisted  any  general 
solution.  One  of  the  main  reasons  is  that  the  source  languages  people  are  trying  to 
parallelize  are  inadequate.  Standard  imperative  languages  rely  on  side-effects  to 
maintain  the  state  of  a  computation,  a  problem  that  is  compounded  by  aliasing 
(the  same  piece  of  state  information  can  have  many  distinct  names).  These  features 
make  imperative  languages  impossible  to  parallelize  except  in  a  few  limited  areas 
(e.g.,  the  extremely  regular  code  found  in  scientific  computations).  Declarative 
languages,  which  are  generally  free  of  side-effects  and  have  a  more  tractable 
mathematical  foundation  (important  in  reasoning  about  both  programs  written  in 
them  and  legal  ways  of  compiling  those  programs),  are  more  promising  starting 
points  for  automatic  parallelization.  Our  research  is  thus  focussed  on  the 
compilation  of  a  particular  class  of  declarative  language  into  a  form  that  can  be 
efficiently  executed  on  multi-processors  such  as  the  Butterfly. 

In  late  1985  we  realized  that  {it  constraint  languages}  were  promising  ones 
with  which  to  work.  A  constraint  language  is  one  in  which  programs  consist  of  sets 
of  relations  between  inputs,  outputs,  and  (possibly)  intermediate  values,  such  that 
the  relations  hold  if  and  only  if  the  output  values  are  correct  for  the  inputs.  Note 
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that  there  is  a  dose  correspondence  between  constraint  languages  and  logic 
languages:  any  relation  in  a  constraint  program  can  be  replaced  by  a  predicate  that 
tests  whether  that  relation  holds,  and  any  predicate  in  a  logic  program  defines  a 
relation  between  its  arguments.  There  is,  however,  an  important  ‘qualitative 
distinction  between  our  prototype  constraint  language  and  other  logic  languages: 
We  provide  a  much  richer  set  of  primitive  relations  than  other  logic  languages,  in 
the  belief  that  doing  so  makes  the  expression  of  general  algorithms  and  their 
potential  parallelism  more  natural.  The  cost  of  our  richer  set  of  primitives  is  a 
more  elaborate  compiler,  as  discussed  below. 

The  ultimate  goal  of  our  research  is  to  show  that  constraint  languages  are  a 
practical  tool  for  programming  multi-processors  containing  several  hundred 
relatively  powerful  processors.  Achieving  this  goal  requires  solving  two  key 
problems.  The  first  is  to  determine  the  features  that  a  general  purpose  constraint 
language  should  have;  the  second  is  to  show  that  such  a  language  can  make 
effective  use  of  a  parallel  computer.  During  the  Winter  and  Spring  of  1986  we 
defined  a  language  called  CONSUL  that  demonstrates  our  solution  to  the  first 
problem.  We  are  now  conducting  a  series  of  experiments  intended  to  test  a 
primitive  dialect  of  CONSUL  on  a  variety  of  programs  and  to  characterize  the 
parallelism  that  it  makes  available  in  each.  These  experiments  will  support  our 
contention  that  CONSUL  is  suitable  for  general  purpose  programming,  and  will 
direct  us  to  the  richest  sources  of  parallelism  in  the  language.  A  later  phase  of  the 
CONSUL  project  will  address  the  second  problem  by  developing  compilers  that 
can  exploit  this  parallelism  on  a  real  multi-processor  (the  Butterfly). 

The  formal  foundation  for  CONSUL  is  axiomatic  set  theory.  Thus  the 
fundamental  data  type  is  the  set,  and  the  fundamental  operators  are  the  logical 
connectives  and  quantifiers.  However,  a  number  of  abstractions  are  built  into  the 
language  to  make  it  more  palatable  to  programmers  than  raw  set  theory.  In 
particular,  the  built-in  data  types  include  familiar  ones  such  as  sequences,  integers, 
characters,  et  cetera.  Each  of  these  types  can  be  given  a  set-theoretic  definition, 
but  programmers  generally  need  not  be  aware  of  it  One  consequence  of  the 
formal  basis  of  CONSUL  that  can  be  important  to  programmers,  however,  is  that 
relations,  being  sets,  can  be  treated  as  data,  and  vice  versa.  This  feature  allows  the 
language  to  include  higher-order  relations  in  a  natural  way.  Each  built-in  data  type 
is  associated  with  built-in  relations  that  correspond  to  common  operations  for  that 
type.  Thus  CONSUL  provides  simple  comparisons,  arithmetic  relations  between 
integers,  and  so  forth  as  language  “primitives”.  Again,  the  fact  that  these 
operations  are  not  really  primitive  to  the  underlying  set  theory  is  invisible  to  users. 
"Die  built-in  relations  can  be  composed  into  more  complex  ones  using  the  logical 
connectives  “and”,  “or”,  and  “not”,  with  their  standard  meanings,  and  the 
quantifier  “for  all”.  “For  all”  is  particularly  useful  as  a  way  of  mapping  relations 
over  sets  in  complex  ways.  (The  existential  quantifier  is  also  available,  but  is 
mainly  used  just  to  declare  variables  and  their  scopes.) 

Over  the  past  few  months  (Summer  1986)  we  have  been  developing  a 
software  system  that  lets  us  estimate  the  parallelism  available  from  CONSUL 
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programs.  This  system  consists  of  a  crude  interpreter  and  a  compactor.  The 
interpreter’s  main  purpose  is  to  note  when  each  relation  in  a  CONSUL  program 
can  be  satisfied  and  what  variables  are  defined  in  the  course  of  doing  so.  This 
information  is  written  to  a  trace  file,  which  is  later  compacted  into  a  maximally 
parallel  form  by  the  compactor.  Because  the  traces  are  taken  from  actual 
CONSUL  programs  in  execution,  the  parallelism  found  by  the  compactor  is 
“oracular”  (see  Nicolau  and  Fisher,  “Using  an  Oracle  to  Measure  Parallelism  in 
Single  Instruction  Stream  Programs”,  14th  ACM  SIGMICRO  Microprogramming 
Workshop,  Oct.  1981)  in  the  sense  that  a  real  compiler  could  fully  exploit  it  only  if 
it  had  perfect  information  about  the  object  program's  run-time  behavior.  Our 
results  thus  indicate  the  upper  bound  on  the  parallelism  that  can  be  derived  from 
CONSUL  programs.  In  the  course  of  developing  the  interpreter  (and  programs  to 
run  on  it)  we  have  also  considerably  refined  our  notion  of  the  proper  semantics  for 
CONSUL.  For  example,  until  early  August  the  mathematical  underpinnings  of 
CONSUL  were  only  intuitively  defined  ™  the  decision  to  unify  and  formalize 
them  set-theoretically  is  a  very  recent  one,  whose  full  implications  we  are  still 
evaluating. 

We  expect  to  complete  the  analysis  of  the  parallelism  experiments  by  the  end 
of  1986.  Starting  in  1987,  we  will  turn  our  efforts  to  compiling  CONSUL  into 
some  form  that  can  execute  on  a  Butterfly  (at  least  initially,  this  form  will  be  some 
high-level  language  with  its  own  Butterfly  compiler,  for  example.  Butterfly 
Common  Lisp,  LYNX,  or  C).  Designing  the  compiler  will  be  a  very  challenging 
project  In  designing  CONSUL,  we  have  deliberately  avoided  the  semantic 
compromises  made  to  ease  implementation  in  languages  like  Prolog.  Thus  the 
theoretical  complexity  of  solving  the  relations  in  a  CONSUL  program  is  worse 
than  in  related  languages.  None  the  less,  we  expect  that  we  can  produce  a  running, 
parallel  implementation  of  CONSUL.  Experience  with  Prolog  (supported  by  our 
own  experiences  as  we  begin  to  think  about  writing  real  CONSUL  programs) 
suggests  that  people  write  declarative  programs  in  fairly  stylized  ways  that  do  not 
push  the  theoretical  complexity  limits  of  the  solution  process.  Thus  we  do  not 
need  to  be  unduly  concerned  about  the  complexity  of  executing  a  CONSUL 
program.  We  believe  that  a  “smart"  compiler  (in  this  case  a  “smart”  compiler 
means  one  with  a  good  symbolic  algebra  facility)  can  compile  out  much  of  the 
searching  for  solutions  that  current  logic  languages  require.  Finally,  techniques  for 
satisfying  constraints  using  only  local  information  have  been  demonstrated  (see 
Steele,  “The  Definition  and  Implementation  of  a  Computer  Programming 
Language  Based  on  Constraints”,  Ph.  D.  Dissertation,  MIT  Dept,  of  Electrical 
Engineering  and  Computer  Science,  Aug.  1980),  and  we  hope  to  be  able  to  adapt 
these  techniques  for  use  in  CONSUL. 
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5.  Vision 

Much  of  our  vision  work  is  supported  under  the  DARPA  Image 
Understanding  Program,  but  SC  Vision  is  also  a  primary  funding  source  for  much 
of  our  vision  work.  Papers  in  the  references  by  Aloimonos,  Ballard,  Bandopadhay, 
Brown,  Cooper,  Hinkelman,  Hollbach,  Narayanan,  Sher,  and  Swain  give  details. 

The  work  of  two  recently  graduated  Ph.D.  students  merits  special  mention. 
Aloimonos'  work  has  centered  on  the  robust  and  reliable  computation  of  intrinsic 
images,  or  physical  parameters  of  the  scene.  He  has  invented  several  new 
techniques,  and  his  method  has  been  to  add  information  sources  rather  than  to 
rely  exclusively  on  apriori  constraints  (such  as  smoothness).  His  work  has  mainly 
been  in  the  domains  of  multiple  frame  vision  (stereo,  motion)  and  in  texture. 
Bandopadhay  has  also  been  working  in  the  domain  of  motion.  His  work  has  been 
to  apply  clustering  to  the  motion  segmentation  and  egomotion  problem,  and  to 
notice  that  proprioceptive  feedback  from  tracking  stationary  points  can  work  with 
vision  to  make  the  egomotion  calculations  easier.  This  tracking  work  is  the 
scientific  motivation  for  our  robotic  hardware,  which  consists  of  two  cameras  on  a 
"robot  head”.  With  this  setup  we  hope  to  investigate  real-time  vision  with  the 
Butterfly  hardware. 

5.1.  BIFF:  A  Butterfly  Vision  Library 

Tom  Olson  and  Liud  Bukys  have  constructed  a  parallel  version  of  the  IFF 
vision  library  written  at  the  University  of  British  Columbia  under  the  direction  of 
Prof.  Havens.  IFF  is  a  file  organization  for  images,  and  an  associated  set  of  image 
processing  and  vision  utilities,  something  like  SPIDER  or  GIPSY.  IFF  programs 
are  written  as  UNIX  filters,  and  the  system  uses  UNIX  pipes  to  concatenate 
operations.  This  is  a  slow  way  to  go  about  things  but  is  very  modular  and  good 
for  interactive  use.  BIFF,  the  parallel  version,  is  much  faster,  both  through 
capitalizing  on  the  innately  parallel  nature  of  many  low-level  vision  operations, 
and  through  use  of  the  large  memory  on  the  individual  butterfly  nodes  to  achieve 
"in-core"  files  that  can  be  passed  from  process  to  process  quickly  through  memory 
mapping.  We  expect  BIFF  to  be  a  useful  tool  and  to  expand  in  the  future  [Olson, 
BPR  in  preparation]. 

5.2.  Segmentation  on  the  BBN  Butterfly  Multiprocessor 

Tom  Olson  has  constructed  an  advanced  program  under  the  Uniform  System 
to  do  segmentation.  We  are  interested  in  the  general  problem  of  combining  the 
outputs  of  low-level  vision  processes  to  produce  robust  interpretations  of  large 
classes  of  input  images.  In  addition,  we  want  solutions  that  make  efficient  use  of 
large-scale  parallel  hardware.  In  order  to  study  these  issues  we  have  chosen  a 
particular  well-studied  problem  (2-d  segmentation)  for  implementation  on  the 
BBN  Butterfly  Multiprocessor.  To  date  we  have  been  more  concerned  with 
communications  and  systems  aspects  of  the  combination  than  with  the 
mathematical  aspects  of  cooperating  constraints  or  evidence  combination. 


Two  features  of  the  Uniform  System  are  particularly  important  for  the  design 
of  our  segmentation  system.  The  first  is  that  load  balancing  is  stochastic;  the 
machine  will  be  used  most  effectively  if  tasks  are  numerous  and  have  small 
execution  times  with  low  variance.  The  second  is  that  critical  Hata  structures  are 
kept  in  a  global  shared  memory  which  ignores  the  distinction  between  remote  and 
local  memory.  Since  individual  memories  have  limited  bandwidth,  it  is  advisable 
to  scatter  heavily  used  data  structures  as  randomly  as  possible  across  the  shared 
address  space.  Caching  local  copies  of  read-only  data  also  helps. 

5.2.1.  The  Segmentation  Problem 

The  segmentation  problem  has  a  number  of  characteristics  that  make  it  a 
good  vehicle  for  studying  integration  strategies.  It  can  be  approached  on  many 
levels,  from  low  level  (color,  texture  and  gradient)  to  intermediate  (shape)  and 
high  level  (semantic  checking  of  region  labels  and  geometric  relationships).  The 
literature  contains  a  large  number  of  algorithms  for  segmenting  based  on  this  or 
that  low-level  feature,  most  of  which  are  relatively  straightforward  to  implement. 
This  was  important  to  us  because  we  wanted  to  concentrate  on  systems  and 
integration  problems  rather  than  on  developing  innovative  low-level  processes. 

5.2.2.  Our  Approach 

Our  program  works  by  recursively  splitting  regions  until  all  regions  satisfy 
some  termination  criteria.  Users  of  the  system  must  provide  a  set  of  functions 
called  experts  which  take  as  their  argument  some  region  and  generate  a  proposed 
segmentation  of  that  region.  The  user  also  provides  a  reconciling  function  which 
integrates  a  set  of  proposed  segmentations  into  a  single  proposal,  which  the  main 
program  then  executes.  Users  are  responsible  for  parallelizing  their  own  functions. 
The  main  loop  of  the  program  can  be  summarized  as  follows  : 

agenda  :  =  original  image 
while  (agenda  is  not  empty) 

parallel-for  (every  region  on  agenda) 

parallel-for  (every  user-supplied  expert) 

apply  the  expert  to  generate  a  proposed  segmentation 
parallel-for  (every  region  on  agenda) 

apply  the  reconciler  to  eliminate  all  but  one  proposal 
if  #  proposals  is  zero,  put  the  region  on  a  terminal  list 
else  execute  the  proposal  and  put  results  on  agenda 
end 
end 

We  claim  that  with  minor  changes  a  large  class  of  currently  used  segmentation 
algorithms  can  be  fit  into  this  model.  Among  its  defects  are  that  a)  there  is  no 
provision  for  merging  and  b)  reasoning  based  on  more  than  one  region  (eg  based 
on  connectivity)  is  forbidden.  These  restrictions  are  unfortunate,  but  they  permit 


the  top-level  program  to  avoid  many  locking  and  concurrency  control  problems. 
5.2.3.  Progress  to  date 

Our  current  implementation  has  only  one  expert  function,  a.  grey-level 
histogram  splitter  loosely  based  on  PHOENIX,  the  multispectral  segmentor  of 
Shafer  and  Kanade1.  The  process  of  generating  a  proposed  segmentation  breaks 
down  into  the  following  stages  : 

1)  compute  the  grey-level  histogram  of  the  image 

2)  apply  heuristics  to  partition  the  histogram  into  a  set  of  intervals 

3)  back-project  the  intervals  onto  the  region  to  generate  a  set  of  bitmaps 

4)  perform  binary-image  smoothing  on  each  bitmap  to  eliminate  very  small  or 
very  thin  regions 

5)  find  four-connected  regions  in  each  bitmap  and  collect  them  into  a  list;  this  is 
the  proposal. 

Stages  1,  3,  and  4  have  been  made  highly  parallel,  essentially  by  performing 
parallel  FOR  loops  over  scan  lines  of  the  image  and  bitmaps.  Stage  2  is  serial  but 
depends  only  on  the  grey-scale  range  of  the  image,  and  is  in  practice  negligible. 
Stage  5  uses  a  serial  sequential  scan  algorithm,  applied  in  parallel  over  the  set  of 
bitmaps;  it  accounts  for  most  of  the  running  time  of  the  algorithm. 

Communication  between  manager,  experts  and  reconciler  is  through  shared 
memory.  An  expert  is  called  on  a  region  by  the  manager;  it  computes  a  proposed 
segmentation  and  stores  it  into  a  field  in  the  region  descriptor.  The  reconciler  is 
called  on  a  region  and  expects  to  find  a  (possibly  null)  list  of  proposals  in  the 
region  descriptor.  It  reduces  the  length  of  the  list  to  one  or  zero  and  returns;  the 
manager  then  executes  the  remaining  proposal,  if  any. 

The  manager  and  user  functions  are  written  in  C  under  the  Chrysalis 
operating  system,  using  the  Uniform  System  library  to  implement  parallel  loops. 
Users  provide  an  initialized  vector  of  pointers  to  the  expert  functions  and  the 
reconciler.  They  may  optionally  edit  the  region  descriptor  to  incorporate  any 
additional  fields  that  they  need.  If  the  region  descriptor  is  unchanged  the  manager 
will  not  need  to  be  recompiled,  but  relinking  is  always  necessary. 

The  manager  makes  use  of  BIFF,  a  locally  developed  image  processing 
library  for  the  Butterfly.  In  order  to  evaluate  our  parallelization  efforts  we  use  of 
modified  version  of  the  Uniform  System  that  provides  real-time  graphic  feedback 
on  the  status  of  the  Butterfly  processors  (working,  idle  or  blocked).  The  display 
portion  of  the  status  monitor  runs  on  a  SUN  workstation  and  communicates  with 
the  Butterfly  via  TCP. 

5.14.  Experience 

PARALLELIZATION  -  Our  experience  with  the  program  described  above 
has  been  that  Amdahl’s  Law  applies  with  a  vengeance;  that  is,  any  non-parallel 
segment  of  the  code  quickly  comes  to  dominate  the  running  time  of  the  system. 


Many  current  blackboard  designs  view  the  blackboard  as  a  server  executing 
requests  made  by  independent,  long-lived  client  processes,  and  we  agree  that  this 
model  will  be  more  manageable  than  ours  for  large  systems.  How  should  this 
model  be  parallelized?  Our  experience  suggests  that  parallelization  across  requests 
will  not  be  sufficient 

LOAD  BALANCING  -  Sophisticated  vision  blackboards  may  offer  to  do 
non-trivial  computations  for  their  clients  (eg  find  convex  hulls  of  point  sets,  as  in 
Stentz  and  Shafer2).  At  times,  therefor,  clients  may  be  idle  while  the  blackboard 
computes,  while  at  other  times  the  blackboard  may  be  idle  waiting  for  new 
requests.  In  a  multiprocessor  environment  this  may  lead  to  an  intolerable  waste  of 
cycles.  What  can  be  done  about  this? 

MODELS  OF  COMPUTATION  -  Answers  to  the  questions  above  depend 
heavily  on  what  model  of  computation  underlies  the  blackboard  and  its  clients. 
The  cooperating  sequential  process  model  provides  a  clean  way  to  express  the 
computation  but  in  our  opinion  often  fails  to  make  efficient  use  of  real  parallel 
hardware.  The  pseudo-SIMD  model  provided  by  the  Uniform  System  can  be 
quite  efficient  (though  it  may  not  be  extendable  to  large  numbers  of  processors  for 
hardware  reasons),  but  makes  writing  well-structured  programs  difficult  Is  there 
an  intermediate  solution,  or  can  the  defects  of  one  of  the  approaches  be  remedied? 

5.2.5.  Future  Work 

The  project  described  above  has  taught  us  quite  a  bit  and  we  intend  to  push 
it  somewhat  further,  at  least  to  the  point  of  incorporating  more  experts  and  a 
nontrivial  reconciling  function.  Ultimately,  however,  it  is  not  the  right  kind  of 
structure  for  a  complete  vision  system.  Segmentation,  as  many  people  have 
observed,  is  not  a  well-defined  problem;  you  cannot  say  whether  a  given 
segmentor  works  until  you  have  specified  what  you  want  to  do  with  its  output 

We  envision  a  segmentor  that  is  an  integral  part  of  a  system  for  constructing 
intrinsic  images.  Segmentation  could  certainly  use  the  outputs  of  e.g.  depth  maps, 
since  depth  discontinuities  should  signal  region  boundaries;  but  preliminary 
segmentations  can  also  help  identify  places  where  the  smoothness  assumptions 
used  in  many  intrinsic  image  calculations  break  down.  Our  ideal  system  would 
operate  on  stereo  pairs  of  color  images.  Monocular  processes  would  compute 
color  constancy  images  and  perform  preliminary  segmentation  into  plain  and 
textured  regions  on  the  basis  of  color  and  brightness.  The  preliminary 
segmentation  would  define  areas  over  which  smoothness  could  be  assumed;  the 
smoothness  assumption  would  then  be  used  to  constrain  the  stereo  correspondance 
search  problem  and  for  shape  from  texture  and  shading.  Reconciling  the  stereo, 
texture  and  shading  images  would  give  a  depth  map  that  could  then  be  used  to 
refine  the  segmentation. 
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6.  Massively  Parallel  Models 
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7.  Conclusion 

In  the  period  covered  by  this  progress  report  the  University  of  Rochester  has 
gone  a  long  way  toward  making  the  Butterfly  Parallel  Processor  into  a  usable 
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DARPA  contractor  in  terms  of  money,  but  DARPA  funding  has  been  well- 
leveraged  with  other  awards,  and  has  been  vital  to  our  ability  to  do  work  that  we 
believe  is  well-positioned  scientifically  and  practically  and  that  has  a  distinctive 
stamp. 
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